Solving a Maze using RL

Author

Michael Hahsler

Install and Load Package markovDP

Install the current development version of [markovDP](https://mhahsler.r-universe.dev/markovDP).

if (!"markovDP" %in% rownames(installed.packages())) 
  install.packages('markovDP', repos = c(
    'https://mhahsler.r-universe.dev', 
    'https://cloud.r-project.org'))

library(markovDP)

Read a Maze

Mazes are so called Gridworld. The markovDP package already comes with the mazes we use in class.

maze_dir <- system.file("mazes", package = "markovDP")
dir(maze_dir)
[1] "empty_2_maze.txt" "empty_maze.txt"   "L_maze.txt"       "large_maze.txt"  
[5] "loops_maze.txt"   "medium_maze.txt"  "open_maze.txt"    "small_maze.txt"  
[9] "wall_maze.txt"   
maze <- gw_read_maze(file.path(maze_dir, "L_maze.txt"))
maze
MDP, list - Maze
  Discount factor: 1
  Horizon: Inf epochs
  Size: 158 states / 4 actions
  Start: s(10,6)

  List components: 'name', 'discount', 'horizon', 'states', 'actions',
    'start', 'transition_prob', 'reward', 'info', 'absorbing_states'
gw_plot(maze)

The transition model is stored as a function.

maze$transition_prob
function(model, action, start.state) {
  P <- structure(numeric(length(S(model))), names = S(model))
  
  ai <- match(action, A(model))
  
  # stay in place for unknown actions
  if (is.na(ai)) {
    warning("Unknown action", action)
    P[start.state] <- 1
    return(P)
  }
  
  # absorbing states
  if (start.state %in% model$info$absorbing_states) {
    P[start.state] <- 1
    return(P)
  }
  
  # move
  rc <- gw_s2rc(start.state)
  rc <- switch(ai, rc + c(-1, 0), rc + c(0, +1), rc + c(+1, 0), rc + c(0, -1), )
  
  es <- gw_rc2s(rc)
  
  # stay in place if we would leave the gridworld
  if (!(es %in% S(model))) {
    es <- start.state
  }
  
  P[es] <- 1
  P
}
<bytecode: 0x612360f0c7e8>
<environment: namespace:markovDP>
gw_plot_transition_graph(maze)

The reward model

maze$reward
  action start.state end.state value
1   <NA>        <NA>      <NA>    -1
2   <NA>        <NA>   s(3,13)   100
3   <NA>     s(3,13)      <NA>     0

The reward is a cost of 1 for each transition and a reward of +100 for reaching the goal state. The best solution will be the shortest path to the goal.

Remember that tree search starts with the start state and explores the space till it finds the goal state.

Solve the Maze MDP as a Planning Agent

We solve the MDP directly using the problem description as a model for the environment. This approach would be used by a planning agent similar to the tree search algorithms.

Since the agent uses a complete model of the environment, we use a model-based reinforcement learning algorithm. Here we use value iteration. The algorithm sweeps in each iteration through all states and updates each state value using the utility of yhr best state it could get to plus the immediate reward. It stops when the state values do not change anymore (less than a set error threshold).

system.time(sol <- solve_MDP(maze, method = "value_iteration"))
   user  system elapsed 
  0.096   0.001   0.097 
sol
MDP, list - Maze
  Discount factor: 1
  Horizon: Inf epochs
  Size: 158 states / 4 actions
  Start: s(10,6)
  Solved:
    Method: 'value_iteration'
    Solution converged: TRUE

  List components: 'name', 'discount', 'horizon', 'states', 'actions',
    'start', 'transition_prob', 'reward', 'info', 'absorbing_states',
    'solution'
gw_plot(sol)

gw_path(sol)
Warning in sample_MDP(model, n = 1, horizon = horizon, epsilon = 0, trajectories = TRUE): Simulation for undiscounted problems need a finite simulation horizon.
Using a maximum horizon of |S| x |A| = 632to avoid infinite loops.
$path
   episode time       s     a   r s_prime
1        1    0 s(10,6)    up  -1  s(9,6)
2        1    1  s(9,6) right  -1  s(9,7)
3        1    2  s(9,7) right  -1  s(9,8)
4        1    3  s(9,8) right  -1  s(9,9)
5        1    4  s(9,9) right  -1 s(9,10)
6        1    5 s(9,10) right  -1 s(9,11)
7        1    6 s(9,11) right  -1 s(9,12)
8        1    7 s(9,12) right  -1 s(9,13)
9        1    8 s(9,13)    up  -1 s(8,13)
10       1    9 s(8,13)    up  -1 s(7,13)
11       1   10 s(7,13)    up  -1 s(6,13)
12       1   11 s(6,13)    up  -1 s(5,13)
13       1   12 s(5,13)    up  -1 s(4,13)
14       1   13 s(4,13)    up 100 s(3,13)

$reward
[1] 87

$solved
[1] TRUE

The color represents the state value. Here are the learned state values and the optimal action for each state (policy).

policy(sol)
      states   V action
1     s(2,2)  89   down
2     s(3,2)  90  right
3     s(4,2)  89  right
4     s(5,2)  88  right
5     s(6,2)  87     up
6     s(7,2)  86  right
7     s(8,2)  85     up
8     s(9,2)  84     up
9    s(10,2)  83     up
10   s(11,2)  82  right
11   s(12,2)  81     up
12   s(13,2)  80     up
13   s(14,2)  79  right
14    s(2,3)  90  right
15    s(3,3)  91  right
16    s(4,3)  90     up
17    s(5,3)  89     up
18    s(6,3)  88     up
19    s(7,3)  87     up
20    s(8,3)  86     up
21    s(9,3)  85  right
22   s(10,3)  84     up
23   s(11,3)  83     up
24   s(12,3)  82  right
25   s(13,3)  81  right
26   s(14,3)  80     up
27    s(2,4)  91  right
28    s(3,4)  92  right
29    s(4,4)  91     up
30    s(6,4)  87   left
31    s(7,4)  86     up
32    s(8,4)  85     up
33    s(9,4)  86  right
34   s(10,4)  85  right
35   s(11,4)  84  right
36   s(12,4)  83     up
37   s(13,4)  82     up
38   s(14,4)  81  right
39    s(2,5)  92  right
40    s(3,5)  93  right
41    s(4,5)  92  right
42    s(6,5)  86   left
43    s(7,5)  85  right
44    s(8,5)  86   down
45    s(9,5)  87  right
46   s(10,5)  86  right
47   s(11,5)  85     up
48   s(12,5)  84  right
49   s(13,5)  83  right
50   s(14,5)  82  right
51    s(2,6)  93  right
52    s(3,6)  94  right
53    s(4,6)  93  right
54    s(6,6)  85   down
55    s(7,6)  86  right
56    s(8,6)  87   down
57    s(9,6)  88  right
58   s(10,6)  87     up
59   s(11,6)  86     up
60   s(12,6)  85  right
61   s(13,6)  84  right
62   s(14,6)  83  right
63    s(2,7)  94   down
64    s(3,7)  95  right
65    s(4,7)  94  right
66    s(6,7)  86   down
67    s(7,7)  87  right
68    s(8,7)  88   down
69    s(9,7)  89  right
70   s(10,7)  88     up
71   s(11,7)  87  right
72   s(12,7)  86  right
73   s(13,7)  85     up
74   s(14,7)  84     up
75    s(2,8)  95  right
76    s(3,8)  96  right
77    s(4,8)  95     up
78    s(6,8)  87   down
79    s(7,8)  88  right
80    s(8,8)  89   down
81    s(9,8)  90  right
82   s(10,8)  89  right
83   s(11,8)  88     up
84   s(12,8)  87     up
85   s(13,8)  86     up
86   s(14,8)  85     up
87    s(2,9)  96  right
88    s(3,9)  97  right
89    s(4,9)  96     up
90    s(6,9)  88   down
91    s(7,9)  89   down
92    s(8,9)  90  right
93    s(9,9)  91  right
94   s(10,9)  90     up
95   s(11,9)  89  right
96   s(12,9)  88     up
97   s(13,9)  87  right
98   s(14,9)  86  right
99   s(2,10)  97  right
100  s(3,10)  98  right
101  s(4,10)  97     up
102  s(6,10)  89   down
103  s(7,10)  90   down
104  s(8,10)  91   down
105  s(9,10)  92  right
106 s(10,10)  91  right
107 s(11,10)  90  right
108 s(12,10)  89  right
109 s(13,10)  88  right
110 s(14,10)  87     up
111  s(2,11)  98   down
112  s(3,11)  99  right
113  s(4,11)  98  right
114  s(9,11)  93  right
115 s(10,11)  92  right
116 s(11,11)  91  right
117 s(12,11)  90     up
118 s(13,11)  89  right
119 s(14,11)  88     up
120  s(2,12)  99  right
121  s(3,12) 100  right
122  s(4,12)  99  right
123  s(5,12)  98  right
124  s(6,12)  97     up
125  s(7,12)  96     up
126  s(8,12)  95  right
127  s(9,12)  94  right
128 s(10,12)  93  right
129 s(11,12)  92     up
130 s(12,12)  91     up
131 s(13,12)  90  right
132 s(14,12)  89  right
133  s(2,13) 100   down
134  s(3,13)   0  right
135  s(4,13) 100     up
136  s(5,13)  99     up
137  s(6,13)  98     up
138  s(7,13)  97     up
139  s(8,13)  96     up
140  s(9,13)  95     up
141 s(10,13)  94     up
142 s(11,13)  93     up
143 s(12,13)  92     up
144 s(13,13)  91     up
145 s(14,13)  90     up
146  s(2,14)  99   left
147  s(3,14) 100   left
148  s(4,14)  99     up
149  s(5,14)  98   left
150  s(6,14)  97   left
151  s(7,14)  96   left
152  s(8,14)  95     up
153  s(9,14)  94     up
154 s(10,14)  93     up
155 s(11,14)  92   left
156 s(12,14)  91   left
157 s(13,14)  90   left
158 s(14,14)  89     up

Solve Step-by-step

gw_animate(maze, "value_iteration", n = 25, zlim = c(-1,100))

Reinforcement Learning for Unknown Mazes

An unknown environment mean that we do not have a model of the environment. The agent does not know the transition function or the reward structure and has to learn a good policy and at least part of the transition and reward models.

Here we assume that the agent has no direct access to the MDP. The agent can only try actions and the environment uses the MDP to simulate how it reacts. Q-Learning is is a model-free approach for reinforcement learning.

system.time(sol <- solve_MDP(maze, method ="q_learning", 
                             horizon = 1000, n = 100, alpha = 0.3))
   user  system elapsed 
  0.954   0.006   0.959 
sol
MDP, list - Maze
  Discount factor: 1
  Horizon: Inf epochs
  Size: 158 states / 4 actions
  Start: s(10,6)
  Solved:
    Method: 'q_learning'
    Solution converged: NA

  List components: 'name', 'discount', 'horizon', 'states', 'actions',
    'start', 'transition_prob', 'reward', 'info', 'absorbing_states',
    'solution'

These methods are horribly data inefficient, but they do not need a model of the environment and do not need to calculate state values for all states.

gw_plot(sol)

Step-by-step Solution using Q-Learning

Q-learning follows an \(\epsilon\)-greedy policy where it takes the current beat action, but with a probability of \(\epsilon\) uses a random action to perform exploration. Initially, the agent has no good policy and essentially performs random walk.

This is how such a random walk looks like.

set.seed(1111)
sample_MDP(maze, n = 1, horizon = 50, trajectories = TRUE)$trajectories
   episode time       s     a  r s_prime
1        1    0 s(10,6) right -1 s(10,7)
2        1    1 s(10,7)    up -1  s(9,7)
3        1    2  s(9,7) right -1  s(9,8)
4        1    3  s(9,8)  left -1  s(9,7)
5        1    4  s(9,7)    up -1  s(8,7)
6        1    5  s(8,7)  left -1  s(8,6)
7        1    6  s(8,6)  down -1  s(9,6)
8        1    7  s(9,6)    up -1  s(8,6)
9        1    8  s(8,6)    up -1  s(7,6)
10       1    9  s(7,6) right -1  s(7,7)
11       1   10  s(7,7) right -1  s(7,8)
12       1   11  s(7,8)  left -1  s(7,7)
13       1   12  s(7,7) right -1  s(7,8)
14       1   13  s(7,8)  left -1  s(7,7)
15       1   14  s(7,7)  down -1  s(8,7)
16       1   15  s(8,7)  left -1  s(8,6)
17       1   16  s(8,6)    up -1  s(7,6)
18       1   17  s(7,6)    up -1  s(6,6)
19       1   18  s(6,6) right -1  s(6,7)
20       1   19  s(6,7) right -1  s(6,8)
21       1   20  s(6,8)  left -1  s(6,7)
22       1   21  s(6,7)  left -1  s(6,6)
23       1   22  s(6,6)    up -1  s(6,6)
24       1   23  s(6,6) right -1  s(6,7)
25       1   24  s(6,7) right -1  s(6,8)
26       1   25  s(6,8)  left -1  s(6,7)
27       1   26  s(6,7)  down -1  s(7,7)
28       1   27  s(7,7)  left -1  s(7,6)
29       1   28  s(7,6) right -1  s(7,7)
30       1   29  s(7,7)    up -1  s(6,7)
31       1   30  s(6,7)  left -1  s(6,6)
32       1   31  s(6,6) right -1  s(6,7)
33       1   32  s(6,7)  down -1  s(7,7)
34       1   33  s(7,7)  left -1  s(7,6)
35       1   34  s(7,6)  down -1  s(8,6)
36       1   35  s(8,6)    up -1  s(7,6)
37       1   36  s(7,6)  down -1  s(8,6)
38       1   37  s(8,6)  down -1  s(9,6)
39       1   38  s(9,6)  left -1  s(9,5)
40       1   39  s(9,5)    up -1  s(8,5)
41       1   40  s(8,5) right -1  s(8,6)
42       1   41  s(8,6)  down -1  s(9,6)
43       1   42  s(9,6)  down -1 s(10,6)
44       1   43 s(10,6)  down -1 s(11,6)
45       1   44 s(11,6)  left -1 s(11,5)
46       1   45 s(11,5)  left -1 s(11,4)
47       1   46 s(11,4)  left -1 s(11,3)
48       1   47 s(11,3)  down -1 s(12,3)
49       1   48 s(12,3)    up -1 s(11,3)
50       1   49 s(11,3)  left -1 s(11,2)

After 50 steps, it still cannot find the goal, which is a problem. This is why I use 1000 as the horizon with the hope that it will randomly run into the goal. After some experimentation, I settled on an \(\epsilon\) of \(0.2\) which means every 5th action is chosen randomly and I use a relatively high, fixed learning rate \(\alpha\) of \(0.3\)

sol <- gw_animate(maze, "q_learning", n = 25, zlim = c(-1,100), 
                  horizon = 1000, epsilon = 0.2, alpha = 0.3)